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VECTORIZATION  AND  IMPLEMENTATION  OF  AN  EFFICIENT 
MULTIGRID  ALGORITHM  FOR  THE  SOLUTION  OF 
ELLIPTIC  PARTIAL  DIFFERENTIAL  EQUATIONS 


INTRODUCTION 

Elliptic  partial  differential  equations  arise  in  a  wide  variety  of 
problems  in  computational  physics  and  engineering.  These  applications 
include,  for  example,  the  calculation  of  incompressible  flows  in 
hydrodynamics,  electromagnetic  potentials  in  magnetohydrodynamics,  and 
molecular  states  in  quantum  chemistry.  Elliptic  equations  also  arise 
from  time-implicit  formulations  of  diffusion  problems,  after  discretization 
of  the  time  variable.  For  applications  involving  time-dependent  physical 
systems,  the  equations  must  be  solved  over  the  spatial  domain  at  each 
times tep.  Over  the  duration  of  a  typical  simulation,  this  amounts  to 
solving  thousands  of  equations  in  hundreds  or  thousands  of  unknowns. 
Executing  this  imposing  task  in  a  practical  fashion  calls  for  the 
development  and  use  of  accurate  and  highly  efficient  numerical  techniques. 
The  importance  of  elliptic  equations  in  applied  mathematics  has  led  to 
extensive  and  varied  efforts  in  this  direction  [cf.  Schultz  1981]. 

During  the  last  decade,  much  attention  has  been  devoted  to  the 

development  of  a  new  class  of  methods,  multilevel  adaptive  techniques  [cf. 

Brandt  1977],  for  solving  many  types  of  numerical  problems.  The  general 

principles  of  multilevel  techniques  apply  equally  to  finite-difference  and 

finite— element  approaches  to  solving  partial  differential  equations,  as  well 

as  to  problems  not  associated  with  such  equations,  and  can  be  summarized  as 

follows.  As  for  all  numerical  methods,  one  begins  by  specifying  a 

discretization  of  the  original  (continuous)  problem,  and  then  seeks  the 

solution  to  this  discretized  problem  in  the  appropriate  finite-dimensional 
Manuscript  approved  November  1,  1984. 
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(physical  or  function)  space.  The  essence  of  the  multigrid  approach  is  to 
establish  a  sequence  of  smaller,  perhaps  nested,  auxiliary  spaces,  and  to 
use  solutions  obtained  in  these  smaller  spaces  to  approximate  the  desired 
solution  in  the  largest  space.  The  solution  in  any  space  can  be  improved  by 
combining  relaxation  iterations  in  that  space,  which  smooth  the  fine-scale 
errors,  with  solving  correction  problems  using  the  smaller  spaces,  which 
reduce  the  coarse-scale  errors.  The  smaller  spaces  have  geometrically  fewer 
unknowns,  and  so  require  far  less  computational  effort  to  yield  a  solution 
than  does  the  largest  space.  The  final  result  is  a  substantial  savings  in 
the  work  required  to  solve  the  problem. 

Multigrid  algorithms  for  elliptic  boundary-value  problems  have  been 
proposed  and  analyzed  by  several  authors  [cf.  Douglas  1984,  and  references 
cited  therein].  Douglas  [1982]  found  a  particularly  efficient  multigrid 
algorithm  for  solving  elliptic  problems,  one  which  in  numerical  tests 
yielded  solutions  with  accuracy  comparable  to  that  obtained  by  several 
alternative  algorithms,  for  a  smaller  amount  of  effort.  His  algorithm  has 
been  implemented  in  finite-difference  form  in  a  manner  suitable  for  use  in 
large-scale  numerical  simulations  on  a  high-speed  vector  computer.  In  this 
report  I  summarize  the  principles  guiding  its  implementation,  discuss  the 
modifications  to  the  algorithm  required  by  these  principles,  and  present  the 
results  of  numerical  tests  of  the  method. 
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THE  MULTIGRID  ALGORITHM 


Suppose  we  are  given  an  elliptic  equation  to  solve  on  the  unit 
square  [0,l]x[0,l]  in  the  (x,y)  plane*  Lay  a  uniform  N*N  mesh  on  the  unit 
square,  and  denote  the  mesh  spacing  by  h  =  1/N.  After  finite-differencing 
the  differential  equation  and  incorporating  boundary  conditions,  the  problem 
is  reduced  to  solving  the  matrix  equation 

Ah  uh  =  f\  (1) 

where  A^  is  the  matrix  of  coefficients,  u1  is  the  sought  solution,  and  f*1  is 
the  inhomogeneous  term,  defined  on  the  discrete  domain  D  .  We  might  attempt 
to  solve  Eq.  ( 1 )  by  a  direct  method  such  as  sparse  Gaussian  elimination,  or 
by  an  iterative  method  such  as  Gauss-Seidel  relaxation.  For  the  direct, 
approach,  the  operation  count  is  0(N3)  to  obtain  a  solution  for  the  N2 
unknowns  in  Eq.  (1),  and  thus  rapidly  becomes  very  large  for  large  problems. 
For  the  iterative  approach,  the  operation  count  is  just  proportional  to  the 
number  of  unknowns;  owing  to  the  slow  relaxation  of  the  coarse-scale  errors 
in  the  solution,  however,  many  iterations  must  be  done  in  order  to  obtain  an 
accurate  result,  and  the  operation  count  is  again  very  large,  because  the 
coefficient  of  N2  is  large.  Multigrid  techniques  take  advantage  of  the 
rapid  reduction  in  effort  expended  by  direct  methods  as  the  number  of 
unknowns  is  reduced,  and  of  the  rapid  relaxation  of  fine-scale  errors  in  the 
solution  by  iterative  methods,  by  utilizing  a  combination  of  the  two. 
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A  simple  multigrid  algorithm  for  solving  this  elliptic  problem  is  the 

following.  Choose  N  to  be  even,  and  lay  two  uniform  grids  on  the  unit 

N  N 

square:  one  NxN  with  mesh  spacing  h,  the  other  with  mesh  spacing  2h. 

h  2h 

On  the  fine  grid  D  we  have  Eq.  (1),  while  on  the  coarse  grid  D  we  have 


,2h  2h  ,2h 
A  u  =  f 


(2) 


We  begin  by  solving  Eq.  (2)  using  a  direct  method.  Since  the  number  of 
unknowns  is  smaller  by  a  factor  of  4,  this  task  requires  a  factor  of  23  =  8 
less  work  than  does  the  same  task  on  D  .  We  then  interpolate  the  solution 
Uh  to  Eq.  (2)  onto  D^, 


=  ih  u2h 

12h  U  ! 


(3) 


where  I  is  an  interpolation  operator.  The  result  U  possesses  fine-scale 

2h  2h 

errors  due  to  the  lack  of  fine-scale  information  in  A  and  f  ,  and  thus 

9  Vi 

U  ,  as  well  as  any  introduced  by  the  interpolation  process.  We  therefore 
do  a  small  number  of  relaxation  iterations  on  U*1  to  smooth  these  fine-scale 
errors ,  and  obtain 


uh  =  R  uh, 


(4) 


where  R  is  a  relaxation  operator,  U*1  is  our  first  approximation  u  ^  ^  to 
the  solution  to  Eq.  (1). 
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We  can  improve  the  approximation  to  u  by  solving  a  correction  problem, 
h  h  h 

Define  the  residual  r  and  the  correction  v  on  D  by  the  expressions 


h  ..h  h  ~h 
r  =  f  -  A  U 


(5) 


and 


h  h  ~h 

v  =  u  -  U  , 


(6) 


respectively,  so  that  v  satisfies 


Ah  h  h 

A  v  =  r  . 


(7) 


Notice  that  Eq.  (7)  has  the  same  form  as  Eq.  (1).  We  now  use  a  projection 

h  _2h 

operator  P  to  project  r  onto  D  , 


2h  2h  h 

r  =  P.  r  , 


(8) 


and  solve  directly  the  coarse-grid  equivalent  of  Eq.  (7), 


.  2h  2h  2h 

A  v  =  r  , 


(9) 


to  obtain  V2h.  We  apply  the  interpolation  operator  I  to  the  correction  V2h, 

h  ~h 

and  add  the  result  V  to  U  .  Smoothing  the  fine-scale  errors  by  applying 
the  relaxation  operator  R,  we  obtain 


uM2)  =  R(Uh  +  Vh), 


(10) 
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our  second  approximation  to  the  desired  solution  on  the  fine  grid. 

This  simple  algorithm  is  a  2-level,  2-cycle  algorithm  for  solving 
elliptic  boundary  value  problems.  It  can  readily  be  extended  to  make  use  of 
additional  coarse  grids  with  mesh  spacings  4h,  8h,  etc.  (more  levels  can  be 
utilized)  or  to  solve  additional  correction  problems  (more  cycles  can  be 
done),  or  both.  Further  options  to  be  exercised  in  the  implementation  of  a 

*/ 

multigrid  algorithm  include  the  selection  of  a  direct  solver  and  of  a 
relaxation  technique,  and  the  specification  of  interpolation  and  projection 
schemes.  The  best  choices  for  many  of  these  options  will  in  general  depend 
upon  the  characteristics  of  the  problem  to  be  solved  (the  properties  of  the 
coefficients  in  the  equation,  the  number  of  spatial  dimensions,  the  number 
of  unknowns,  the  accuracy  required  in  the  solution)  and  the  characteristics 
of  the  computing  machinery  to  be  used  (capabilities  for  vector  or  parallel 
processing,  relative  costs  of  execution  and  core  storage  time). 

Several  multilevel  algorithms  for  elliptic  boundary  value  problems  have 
been  suggested  and  analyzed  by  various  authors,  for  both  finite-element  and 
finite-difference  discretizations.  I  show  schematically  in  Fig.  1  a  4- 
level,  2-cycle  example  of  each  of  four  algorithms.  The  first  is  due  to 
Federenko  [1961],  the  second  is  due  to  Brandt  [1977],  the  third  is  an 
iterative  extension  by  Brandt  [1978]  of  a  2— level  algorithm  also  due  to 
Federenko  [1961],  and  the  fourth  is  a  hybrid  of  the  latter  two  schemes 
investigated  by  Douglas  [1982].  The  general  N-cycle  hybrid  algorithm 
consists  of  N-l  cycles  of  the  second  algorithm  followed  by  a  half-cycle  of 
the  third.  The  first  two  algorithms  are  different  in  that  in  the  first  case 
smoothing  iterations  are  done  after  projection,  while  in  the  second  they  are 
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ds  =  direct  solve 
m  =  m  smoothing  iterations 
1  =  one  smoothing  iteration 

0  =  no  smoothing  iterations 

b  =  bilinear  interpolation 
c  =  cubic  interpolation 


FIGURE  1.  Four  multigrid  algorithms  for  elliptic  boundary  value 
problems,  illustrated  by  4-level,  2-cycle  examples. 
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done  after  interpolation;  they  are  alike  in  that  they  are  both  recursive, 
i.e.,  they  cycle  at  all  levels  save  the  lowest  one.  The  third  algorithm 
allows  smoothing  after  both  projection  and  interpolation,  and  is  not 
recursive,  cycling  occurring  only  at  the  highest  level.  Douglas  [1982, 

1984]  has  shown  that  the  recursive  algorithms  are  optimal  order  algorithms, 
i.e.,  the  operation  count  for  a  solve  to  truncation  error  is  proportional  to 
the  number  of  unknowns ,  if  the  condition 

2  <  C  <  26  (L1) 

is  satisfied,  where  C  is  the  number  of  cycles  and  6  is  the  number  of  space 

£ 

dimensions  in  the  problem.  For  the  case  C  =  2  (e.g.,  a  1-dimensional 

problem  solved  by  a  2-cycle  algorithm),  the  operation  count  for  N  unknowns 

is  O(NlogN). 

Douglas  [1982]  carried  out  extensive  numerical  experiments  on  these 
algorithms,  using  sparse  Gaussian  elimination  for  the  direct  solves  and 
Gauss-Seidel  relaxation  for  the  smoothing  iterations.  He  considered  various 
schemes  for  the  interpolation  of  the  solutions  and  corrections  and  for  the 
projection  of  the  residuals.  His  results  indicate  that  a  combination  of 
high-order  (cubic)  interpolation  of  the  solutions  and  low-order  (bilinear) 
interpolation  of  the  corrections  is  optimal;  high-order  interpolation  of  the 
corrections  often  degrades  the  accuracy  of  the  resulting  'corrected' 
solution,  and  additional  smoothing  iterations  are  necessary  to  compensate. 

He  found  it  advantageous  to  smooth  the  residuals  on  projection  by  using  a 
weighted  average  of  the  neighboring  values  on  the  finer  grid.  Both  of  these 
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findings  reflect  the  purpose  of  the  correction  problem,  which  is  to  improve 
the  accuracy  in  the  coarse-scale  features  of  the  solution,  leaving  the  fine- 
scale  features  to  the  relaxation  scheme.  Douglas  further  found  that  the 
hybrid  algorithm  yielded  solutions  with  accuracy  comparable  to  that  obtained 
by  Federenkofs  [1961]  and  Brandt Ts  [1977]  algorithms,  for  a  smaller  amount 
of  work.  By  comparison,  sparse  Gaussian  elimination  on  the  finest  grid 
yielded  the  same  accuracy  at  a  substantial  increase  in  computational 
resources:  a  factor  of  10  in  operations  and  a  factor  of  4  in  memory,  for  a 

problem  in  1000  unknowns.  Gauss-Seidel  relaxation  to  convergence  on  the 
finest  grid  required  an  increase  of  a  factor  of  100  in  operations  over  the 
multigrid  algorithms  for  that  problem. 


9 


VECTORIZATION  OF  THE  ALGORITHM 


Efficient  computational  techniques  make  optimal  use  of  computer 

resources  by  minimizing  some  machine-dependent  combination  of  execution  and 

core  storage  time.  Given  that  a  consideration  of  the  tradeoffs  between 

execution  and  core  storage  for  a  particular  technique  establishes  a  minimum 

memory  requirement  for  its  efficient  use,  the  task  which  remains  is  to 

minimize  the  execution  time  subject  to  this  memory  constraint.  For  a  scalar 

(sequential)  processor,  which  can  execute  one  instruction  on  one  scalar  or 

pair  of  scalars  at  a  time,  this  task  is  essentially  identical  to  that  of 

minimizing  the  total  number  of  arithmetic  operations  performed.  For  a 

vector  processor,  which  can  execute  a  single  instruction  on  all  of  the 

elements  of  a  vector  or  pair  of  vectors  serially,  the  direct  correspondence 

between  execution  time  and  number  of  operations  fails,  owing  to  the  marked 

difference  in  speed  of  execution  (an  order  of  magnitude  or  greater)  of 
> 

vector  code  vis-a-vis  scalar  code.  This  circumstance  places  a  premium  on 
the  use  of  vectorizable  instructions,  even  at  the  expense  of  a  substantial 
increase  in  the  total  number  of  operations  performed.  Similar 
considerations  apply  to  an  n-component  parallel  processor,  which  can  execute 
n  different  instructions  on  n  independent  vectors  or  pairs  of  vectors 
simultaneously;  a  larger  operation  count  can  be  borne  by  a  more  highly 
parallel  algorithm,  at  a  net  reduction  in  execution  time. 

Vectorization  of  several  of  the  component  operations  of  the  multigrid 
algorithms,  specifically  the  generation  of  the  matrices,  the  interpolation 
of  the  solutions  and  corrections,  and  the  calculation  and  projection  of  the 
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residuals,  is  straightforward.  The  direct  solver  is  employed  only  on  very 
coarse  grids,  to  solve  problems  with  few  unknowns.  Its  operation  count  is 
therefore  usually  a  negligible  fraction  of  the  total,  and  furthermore  the 
vectors  on  which  it  operates  are  generally  too  short  for  a  significant  gain 
in  speed  by  the  processor,  so  that  vectorization  is  not  an  overriding 
consideration  in  its  selection.  The  largest  share  of  the  arithmetic 
operations  executed  by  a  multigrid  algorithm  is  performed  by  the  relaxation 
method,  primarily  on  the  finer  grids  possessing  many  unknowns.  The 
advantages  accruing  to  extensive  vector  processing  thus  figure  prominently 
in  the  selection  of  a  smoothing  technique.  A  discussion  of  these 
considerations  for  parallel  processing  applications  has  been  given  by  Brandt 


[1981]. 


A  classical  relaxation  method  eminently  suited  for  use  on  a  vector 
processor  is  Jacobi  relaxation.  Given  the  set  of  simultaneous  equations’ 


(12) 


where  M,  0,  and  an  estimate  Xn,  the  iteration  reads 


(13) 


A  more  rapidly  convergent  method  is  Gauss-Seidel  relaxation,  in  which  the 


(14) 
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The  latter  iteration,  however,  does  not  vectorize  because  the  operands  in 
the  calculation  of  depend  on  the  results  of  the  same  instruction  carried 
out  for 

A  method  which  realizes  an  increase  in  the  convergence  rate  by  using 
updated  values  of  X,  but  which  vectorizes  completely,  is  red-black  or 
checkerboard  relaxation.  In  this  scheme,  Eq.  (13)  is  first  applied  to  all 
'red'  (even  k)  points;  then  for  all  'black'  (odd  k)  points,  calculate 


E  >xf 1  -  hhe  . 

even  l  odd  i 


1 ' 


(15) 


A  potential  shortcoming  of  red-black  relaxation  is  that  the  retrieval  and 
storage  of  the  operands  and  the  results  is  effectively  reduced  by  about  a 
factor  of  two  due  to  the  on-off  pattern  of  the  iteration.  If  the  machine 
has  a  narrow  memory  bandwidth,  the  execution  speed  may  be  reduced 
sufficiently  below  that  of  Jacobi  relaxation  to  negate  most  of  the  gain  in 
the  rate  of  convergence.  This  inefficiency  can  be  circumvented  by  storing 
all  Tredr  points  contiguously  in  memory,  and  likewise  with  all  ’black1 
points,  but  at  a  substantial  increase  in  complexity  in  other  components  of 
the  solver.  The  simplicity  and  the  ease  of  implementation  of  the  Jacobi 
relaxation  scheme  may  in  many  cases  make  it  the  superior  method. 
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IMPLEMENTATION  OF  THE  ALGORITHM 


In  numerical  hydrodynamics  calculations  using  finite  differences,  it  is 
advantageous  after  subdividing  the  domain  of  the  calculation  to  define  the 
fluid  variables  at  points  interior  to  the  subdomains  (i.e.,  at  the  cell 
centers),  as  shown  in  Fig,  2(a)  for  a  2 -dimensional  problem.  Fluxes  are 
defined  on  the  boundaries  of  the  subdomains  (i.e.,  at  the  cell  interfaces) 
which  describe  the  transport  of  mass,  momentum,  and  energy  from  each  cell  to 
its  neighbors.  With  the  exception  of  the  fluxes  through  the  boundary  of  the 
system,  no  net  change  in  the  total  mass,  momentum,  and  energy  of  the  system 
is  effected,  since  a  flux  loss  by  one  cell  is  balanced  by  a  flux  gain  by  its 
neighbor.  The  conservation  of  mass,  momentum,  and  energy  required  by  the 
laws  of  hydrodynamics  is  therefore  achieved  by  the  numerical  solutions. 

When  applying  multigrid  techniques  to  solve  elliptic  boundary-value 
problems,  on  the  other  hand,  it  is  advantageous  to  define  the  dependent 
variable  on  the  cell  interfaces,  as  shown  in  Fig.  2(b).  The  coarse-grid 
points  are  then  coincident  with  points  on  the  finest  grid,  and  this 
coincidence  simplifies  the  intergrid  transfers  (interpolations  and 
projections)  required  by  the  multigrid  algorithm.  Douglas  [1982]  used 
such  nested  grids  to  obtain  his  numerical  results. 

Modifications  to  the  intergrid  transfers  in  Douglas T  multigrid 
algorithm  have  been  made  to  accomodate  the  change  from  cell-interface  to 
cell-center  locations  for  defining  the  dependent  variable.  Fig.  3  shows  the 
spatial  relationships  between  nine  coarse-grid  points  and  sixteen  fine-grid 
points,  and  some  of  their  associated  cell  interfaces.  None  of  the  coarse- 
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FIGURE  2.  Subdivision  of  a  2-dimensional  rectangular  domain  into  32  =  9 

subdomains.  For  hydrodynamic  calculations,  dependent  variables 
are  defined  most  conveniently  at  points  interior  to  the 
subdomains,  at  cell  centers  (a).  For  multigrid  solutions  to 
elliptic  problems,  dependent  variables  are  defined  most 
conveniently  at  cell  interfaces  (b). 
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2i,2j 


2i,2j-1 


FIGURE  3.  Spatial  relationships  between  coarse-grid  points  (t)  and  fine- 
grid  points  (•),  and  between  their  associated  cell  interfaces. 
The  indices  of  the  one  coarse-grid  point  and  the  four  fine-grid 
points  in  the  center  of  the  figure  are  shown. 
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and  fine-grid  points  are  coincident.  The  fine-grid  mesh  spacings  in  the  x- 
and  y-directions  are  denoted  by  Ax  and  Ay,  respectively. 

The  interpolation  formulae  are  easily  derived.  Define  for  a  given  pair 

xo>  yo 


Fa&  =  F(X0  +  aAx,y0  +  SAy),  (16) 

for  any  function  F(x,y).  A  Taylor  series  expansion  yields 

Fl/2  1/2  =  ~16  F°°  +  76  F22  +  "32  ^F20  +  F02  “  F“20  ~  F0“2^ 

< 

to  0(Ax3 ,  Ay3),  and 


f1/2  1/2 


iF 


00  +  4  ^F20  +  F02 ^ 


(18) 


to  0(Ax2,  Ay2).  The  quadratic  interpolation  of  Eq.  (17)  is  applied  to  the 
solutions  U,  while  the  bilinear  interpolation  of  Eq.  (18)  is  applied  to  the 
corrections  V.  Thus,  for  example,  referring  to  Fig.  3,  at  the  fine-grid 
point  (2i,2j)  we  have 


j*1  =  —  +  —  u2h 

2i , 2j  16  Ui,j  16  Ui+1 ,  j+1 


+  3_  (u2h  2h  _  2h  _  2h 

32  C Ui+l,j  i,j+l  i-l,j 


(19) 


and 


21,23 


T  y2h-  +  T  (V2+,  •  +  V?h.  .). 

2  i,j  4  1+1,3  i»J+l 


(20) 
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At  the  boundary  of  the  domain,  where  the  necessary  coarse-grid  points  do  not 
exist,  quadratic'  or  bilinear  extrapolation  is  used,  as  appropriate.  The 
cubic  interpolation  of  the  solutions  employed  by  Douglas  is  rather  unwieldy 
for  use  here,  since  the  interpolation  must  be  performed  at  every  fine-grid 
point,  in  two  dimensions.  In  his  implementation,  interpolation  must  be  done 
f  only  at  the  new  (non-coincident)  fine-grid  points,  and  then  primarily  only 

in  one  dimension,  x  or  y,  since  most  of  the  new  fine-grid  points  are 
collinear  with  the  coarse-grid  points. 

The  residual  projection  is  accomplished  by  taking  an  equally-weighted 
average  of  the  sixteen  nearest  fine-grid  neighbors  of  each  coarse-grid 
point.  Again  referring  to  Fig.  3,  the  residual  projection  at  (i,j)  is 
therefore  given  by 


2h 
r .  . 


2i+l  2j+l 

—  l  i  rh 

16  L  1  kr 

k=2i-2  £=2j-2 


(21) 


At  the  boundary  of  the  domain,  those  fine-grid  residuals  in  the  sum  which 

are  undefined  (belonging  to  fine-grid  points  beyond  the  boundary)  are  just 

omitted  from  the  average.  Projection  schemes  which  assign  a  larger 

weighting  factor  to  the  four  inner  neighbors  than  to  the  twelve  outer  ones 

did  not  perform  as  well  as  the  equal-weighting  scheme. 

A  further  issue  in  the  intergrid  transfers  not  considered  by  Douglas  is 

the  projection  of  the  coefficients  and  source  terms  in  the  difference 

equations.  Consider  the  following  2-dimensional  elliptic  boundary-value 

problem,  defined  on  the  rectangular  domain  D  =  [x  .  ,x  ]x[y  .  ,y  ]: 

*  5  &  min  max  J  min  J  max 
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+  3^Eu) 


5y 


<Su)  +  Tu  =  f 


(22) 


in  the  interior,  and 


au  + 


3u 

3^ 


=  c 


(23) 

r 


on  the  boundary  (n  is  the  coordinate  normal  to  the  boundary).  Subdividing 
the  domain  into  cells  with  mesh  spacings  Ax  and  Ay  and  finite- 

differencing  Eqs.  (22)  and  (23),  we  obtain 


lPi+l/2,3  <“1+1,3  -  “1.35  '  <Ui,j  -  "l-l^1  + 


Ay^ 


!Ql,j+l/2  (ui,j+l  -  "i,3>  '  Ql,j-l/2  <Ui,j  -  "lj-l”  + 


2S  IEi+l/2,3  (ui+l,3  *  “i,3>  '  Ei-l/2,j  <ul,j  *  "1-1,3 


.  (u,  .  +  u.  ,  . ) J  +  (24) 


2AT  Isi, 3+1/2  <"1,3+1-  *  ul,3)  -  Si, 3-1/2  <ui,3  "  ui,3-l))  + 


T.  .  u.  .  =  f .  .  , 

i,3  1,3  i,3 


together  with 


l/2,j 


(u 1,3  +U0,3}  + 


1/2, j 

Ax 


<ul,3  ‘  u0,3)  ’ 


1/2,3 


(25a) 


t 
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and 


N1+l/2,j 


(u.T  ,  .  .  +  u,T  . )  + 

N]_+l ,  J  NL  ,j 


^+1/2, j 

Ax 


(25b) 


=  c 


Nx+l/2,j 


at  x  =  x  .  and  x  ,  respectively,  and  with  corresponding  expressions  to 
min  max 

Eqs.  (25)  at  y  =  y  .  and  y 
n  J  J  nun  J  max 

The  quantities  with  half-integer  indices  in  Eqs.  (24)  and  (25)  are 
defined  at  the  cell  interfaces,  and  those  with  integer  indices  at  the  cell 
centers.  Each  of  the  coarse-grid  interfaces  is  coincident  with  two  fine- 
grid  interfaces  (cf.  Fig.  3).  The  projections  of  the  cell-interface 
coefficients  to  the  coarse  grid  are  therefore  carried  out  by  averaging  the 
two  values  on  the  fine  grid,  e.g., 


and 


p2h 

i+l/2,j 


,2* 


Jl  (pfr  +  p^1  ) 

2  v  2i+l/2,2j  21+1/2, 2j-l' 


-  fph  ph  X 

2  U2i-l/2,2j  +  P2i-3/2,2j-i; 


(26a) 


(26b) 


* 


The  projections  of  the  cell-center  coefficient  T  and  source  term  f  are  done 
by  averaging  the  values  at  the  four  surrounding  fine-grid  cell  centers, 

e.g. , 


■  7  <4-1,23-1  +  4-l.y  +  4,23-1  +  T2i,23>-  <27) 
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A  more  distributed  projection  of  the  source  f  (e.g. ,  the  sixteen-neighbor 

average  applied  to  the  residuals)  causes  a  loss  of  fine-scale  information 

■» 

and  degrades  the  accuracy  of  the  solution  obtained. 

The  multigrid  algorithm  as  described  thus  far  is  clearly  applicable  to 

problems  in  which  the  domain  is  uniformly  gridded.  In  many  applications, 

however,  nonuniform  gridding  in  one  or  both  dimensions  is  necessary  in  order 

to  guarantee  sufficient  spatial  resolution  in  some  fraction  of  the  domain 

without  requiring  a  prohibitively  large  number  of  unknowns  in  the  problem. 

A  simple  algebraic  transformation  is  available  which  converts  the  given 

finite-difference  equation  on  the  nonuniform  grid  into  a  related  equation 

on  a  uniform  grid,  and  the  latter  can  be  solved  by  multigrid  techniques. 

For  problems  in  which  the  grid  nonuniformities  are  mild,  the  resulting 

solution  to  the  uniform-grid  equation  may  be  a  satisfactory  approximation  to 

the  solution  to  the  original  nonuniform-grid  equation. 

The  transformation  to  the  uniform-grid  problem  is  found  from  the 

equivalent  of  Eqs.  (24)  and  (25)  on  the  nonuniformly  gridded  domain  D. 

Denote  the  x-direction  cell-interface  and  cell-center  locations  by 

(i  =  0,...,^)  and  x±  (i  =  0, . . .  .Nj+l),  respectively,  where  x1/2  =  xmin  and 

x  , =  x  :  the  cell  centers  x_  and  x„  thus  lie  beyond  the  boundary- 
N^  +  l/2  max  0  N^  +  l 

of  the  domain.  Define  the  local  cell-interface  and  cell-center  separations 
and  the  average  over  the  grid,  respectively,  by 


Ax. 

l 


Xi+l/2  Xi-l/2* 


(28a) 


Ax 


=  X 


i+1 


-  x. 


i+1/ 2 


(28b) 


and 


Ax 


(x 

max 


Sin>  1  N1 


(28c) 


Similar  expressions  hold  for  the  y— direction  grid  quantities* 


With  these 


definitions,  Eq.  (24)  generalizes  to 


L  .  r  /u  _  u  ) - h  (u  .  -  u  .)]  + 

Axi  Axi+l/2  (  1+1  *J  1,j  Axi— 1/ 2  1,3  1  1,3 


_L_  rji-dllZl  (u  -  u.  .)  - 
Ay,  Ay^,/o  1»j+1 


j  j+1/  2 


Ay, 


3-1/2 


(ui,j  + 


Ax 


—  [^i  (u^,  ,  +  »,  ^ ^  m  *  “i-i.j,1  ♦ 


1+1,  j  i,j 


(29) 


&7T  1%^  +  “i,3>  -  Sj^U1  (“i,j  -  "ij-i51  + 


T 


u 


> 


and  Eqs.  (25)  to 


1/2,  j 


1/2, j 


<ui,j  +U0,J)  +s 


1/2 


("l,3  '  “0,j* 


=  c 


1/2, j 


(30a) 


* 

* 


and 
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(30b) 


X+1/2J  Nj^+1/2 ,  j 

- o -  (uM  j_i  4  +  .)  +  — -  (u* 


"  uM  .) 


Nl+l»j  Nl»J  AxN1+1/2  Nl+1’j  Nl,j 


=  c 


N1+l/2,j* 


Ax_^  Ay . 

Now  multiply  Eq.  (29)  by  the  factors  — -  and  — and  define  the  following: 

Ax  Ay 


.*  _  Ax  Ayi  * 

?i+l/2,j  =  Pi+l/2,j  Qi  > j+1/2 

axi+l/2  7 


Ax^  Ay 


Ax  Ayj+l/2 


Qi, j+1/2 


*  -  Ayi 

Ri+l/2,j  H  Ri+l/2,j 


* 

s. 


Ax. 

l 


S.  . 


i, j+1/2  ~  aI  i>j+1/2 


* 

T. 


1,J  Ax  Ay 


Ax .  Ay , 

l  7  i 


A  V  A  TT  1  5  J 


* 
f . 


Ax^  Ay^ 
Ax  Ay 


f . 


’1/2, j 


Ax 


Ax 1/2  1/2’j 


Ax 


N^+l/2, j  Axn.+1/2  Nl+1/2»d 


a  =  a 


c  =  c 


The  result  of  these  operations  is  to  put  Eqs.  (29)  and  (30)  into  the  form  of 

_  —  * 

Eqs.  (24)  and  (25),  with  Ax  and  Ay  replaced  by  Ax  and  Ay  and  with  the 

coefficients  replaced  by  their  asterisked  counterparts.  f 
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A  final  note  on  implementation  concerns  Poisson  equations  on  a 
uniformly  gridded  domain.  For  this  special  case,  P=Q=1  and  R=S=T=0,  the 
coefficients  in  the  finite-difference  equation,  Eq.  (24),  are  independent  of 
the  indices  i  and  j.  Because  of  this,  the  core  storage  required  to  solve 
the  problem  is  vastly  reduced  and  the  operation  count  for  the  solve  is  also 
decreased  significantly.  These  advantages  have  been  exploited  in  a  solver 
specifically  designed  to  solve  Poisson  equations  on  a  uniform  grid.  When 
solutions  to  Poisson  equations  on  a  nonuniform  grid  are  required,  the 
transformations  above  can  be  carried  out  and  a  general  elliptic  equation 
solver  used. 


* 

% 
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NUMERICAL  TESTS 


The  multigrid  algorithm  proposed  by  Douglas  [1982]  and  illustrated 
in  Fig.  1(d)  has  been  vectorized  and  implemented  as  described  in  the 
preceding  two  sections,  in  a  FORTRAN  program  for  use  on  the  Naval  Research 
Laboratory1 s  Texas  Instruments  Advanced  Scientific  Computer  (ASC)  [Texas 
Instruments  Incorporated  1975].  The  ASC  is  a  32-bit  word,  2-pipe  vector 
computer  with  one  million  words  of  high-speed  central  memory.  Each  of  the 
two  pipes,  which  consist  of  one  memory  buffer  unit  and  one  arithmetic  unit 
per  pipe,  can  produce  eight  words  of  output  data  per  80  ns  central  processor 
clock  cycle,  under  optimum  conditions.  The  memory  control  unit  can  access 
one  eight-word  octet  of  input  data  per  160  ns  central  memory  clock  cycle. 

At  peak  efficiency,  the  ASC  is  capable  of  retrieving  50  million  operands  per 
second  and,  thus,  of  generating  up  to  50  million  results  per  second.  Since 
the  two  pipes  are  in  principle  able  to  execute  as  many  as  200  million 
operations  per  second,  for  highly  vectorized  code  the  execution  speed  is 
limited  by  the  memory  bandwidth.  A  premium  is  therefore  placed  on  the 
efficient  retrieval  and  storage  of  data  in  optimizing  ASC  software. 

The  limiting  of  the  execution  speed  by  the  memory  bandwidth  influences 
the  selection  of  the  relaxation  algorithm,  as  discussed  earlier.  The  red- 
black  relaxation  makes  use  of  one— half  of  the  eight  words  in  each  octet 
retrieved  from  memory,  and  as  a  result  executes  at  roughly  half  of  the  speed 
of  the  Jacobi  relaxation,  which  makes  use  of  the  entire  octet.  All  of  the 
results  presented  here  were  obtained  using  the  Jacobi  relaxation. 

The  direct  solver  for  these  tests  uses  a  Crout  lower-upper  triangular 


decomposition  of  the  matrix.  Parts  of  the  solver  are  vectorized,  and  it 
takes  advantage  of  the  banded  structure  of  the  matrices  arising  from  partial 
differential  equations.  The  solver  was  adapted  from  a  routine  written  by 
Winsor  [ 1976 ] . 

The  first  set  of  test  results  is  for  the  boundary-value  problem  on  the 
unit  square, 


3  ,  xy  3uN  .  3  ,  xy  3uN  ,  3  ,  N  ,  9  ,  *  /  2  .  2\ 

S<e  +  s?<e  s?’  3^yu)  3?(x“)  (x  y  )u 


=  f 


(31) 


in  the  interior,  and 


u  =  0 


(32) 


on  the  boundary.  The  solution  is  chosen  to  be 

u(x,y)  =  sin(2Trx)sin(iTy) ,  (33) 

and  the  source  term  f(x,y)  is  defined  by  substituting  this  solution  into  the 
left  side  of  Eq.  (31).  Some  results  are  summarized  in  Table  1.  There,  for 
three  different  grid  resolutions  (16x16,  32x32,  and  64x64)  and  for  L-level, 
C-cycle  solves  for  several  values  of  L  and  C,  are  tabulated  the  number  of 
floating-point  operations  performed  per  fine-grid  point,  the  solve  time  per 
point  in  ps,  and  the  maximum  absolute  error  between  the  numerical  solution 
and  the  exact  solution  (Eq.  (33))  evaluated  at  the  grid  points.  Also 
included  is  the  storage  required  by  the  solver  for  each  resolution  and  each 
value  of  L,  given  as  the  number  of  fine-grid  arrays  used.  The  results  of 
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direct  solves  on  the  finest  grid  (L-C— 1)  are  tabulated  for  the  two  lower 
resolution  cases  ( 16x16  and  32x32),  there  being  insufficient  memory  to  carry 
out  the  direct  solve  for  the  highest  resolution  case  (64x64).  For  the  first 
C-l  multigrid  cycles,  four  Jacobi  relaxation  sweeps  were  done  after  each 
interpolation  (m=4  in  Fig.  1(d));  the  error  decreases  only  very  slowly  with 

f 

additional  sweeps,  but  increases  quite  rapidly  with  fewer  sweeps. 

The  core  storage  M  and  the  operation  count  W  for  the  multigrid  solver  ^ 

asymptote  as  L  increases  in  Table  1.  At  each  level,  five  arrays  of  storage 
are  needed  for  the  five  bands  of  coefficients  in  the  matrices,  and  one  each 
is  needed  for  the  source  term  and  for  the  solution.  An  additional  array  at 
each  level  is  used  to  store  the  reciprocals  of  the  matrix  diagonal  elements. 

This  allows  a  (slow)  divide  to  be  replaced  by  a  (fast)  multiply  in  each 
relaxation  iteration.  Two  further  fine-grid  arrays  are  needed,  one  for  the 
residual  calculation  and  another  as  general-purpose  work  space.  The  total 
storage  required  therefore  asymptotes  to 

M  =  (1  +  x  +  ...)‘8  +  2  =  -|«8  +  2  =  12.7. 

The  asymptotic  operation  count  per  cell  for  the  matrix  generation  is  35,  and 
for  the  solve  itself  is  120  for  the  2-cycle  solve  and  405  for  the  3-cycle 
solve.  The  total  operation  counts  therefore  asymptote  to 

W  =  35  +  120  =  155 

C-2  * 

and 

W  =  35  +  405  =  440  f 
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TABLE  1.  Elliptic  test  problem  results  obtained  on  (a)  16x16,  (b)  32x32,  and 
(c)  64x64  grids.  For  each  L-level,  C-cycle  solve  (L=C=1  is  direct 
solve),  the  operation  count  per  cell,  the  solve  time  in  ys  per 
cell,  and  the  maximum  absolute  error  are  tabulated.  For  each  L, 
the  core  storage  required  in  number  of  fine-grid  arrays  is  given. 
Insufficient  memory  was  available  to  solve  the  64x64  problem 
directly  (*). 


Cycles 

i 

2 

Levels 

3 

4 

5 

i 

617 

120 

2.9xl0-5 

operations /cell 
y s/cell 
maximum  error 

2 

184 

60 

4.7xl0-3 

139 

51 

6.7xl0~3 

3 

254 

95 

1 . 5x 10-3 

Storage 

47.1 

17.7 

14.1 

i 

2233 

266 

8.3xl0"5 

2 

396 

70 

1.5xl0-3 

161 

34 

3.0X10-3 

147 

31 

3.5xl0"3 

3 

307 

66 

2.6xl0“4 

305 

77 

3.2xl0-4 

Storage 

78.6 

21.0 

13.8 

13.5 

i 

8538 

* 

* 

2 

1204 

141 

3.3xl0“4 

241 

36 

6.9X10-4 

156 

23 

9.2X10-4 

151 

22 

1.0X10-3 

3 

494 

74 

5.6xl0-5 

342 

56 

4.6xl0-5 

345 

66 

6.6X10-5 

Storage 

142.3 

28.6 

14.2 

13.0 

13.1 
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The  tabulated  values  for  the  latter  are  significantly  smaller,  owing  to  the 
slow  convergence  of  the  operation  count  to  its  asymptotic  limit  as  L 
increases . 

The  advantages  of  the  multigrid  solve  over  the  direct  solve  in  core 
storage,  operation  count,  and  execution  time  are  clearly  evident  for  all 
three  cases,  particularly  the  two  higher  resolution  problems.  Although 
the  16x16  multigrid  solves  are  substantially  less  accurate  than  the  direct 
solve,  the  3-cycle  solves  on  the  32x32  grid  are  comparable  in  accuracy  to 
the  direct  solve,  and  those  on  the  64x64  grid  exceed  the  inferred  accuracy 
of  the  direct  solve.  The  accuracy  of  the  multigrid  solutions  varies  rather 
slowly  with  L,  but  improves  markedly  with  an  increase  in  C,  at  the  price  of 
an  increase  of  roughly  a  factor  of  two  in  operation  count  and  execution  time 
for  the  examples  shown.  The  effect  of  decreasing  vector  length  on  the  speed 
of  execution  of  the  algorithm  is  apparent  in  the  3-cycle  solves  for  the 
32x32  and  64x64  problems.  For  L  -  3  and  4  and  L  =  4  and  5,  respectively, 
the  operation  count  is  essentially  identical,  but  the  execution  time  is 
greater  by  over  15%,  for  the  larger  values  of  L  compared  to  the  smaller 
values.  For  the  2-cycle  solves,  on  the  other  hand,  the  execution  time  is 
slightly  smaller  for  the  larger  values  of  L.  These  results  suggest  the 
simple  rule  of  thumb  that  the  coarsest  grid  used  be  approximately  8x8,  for 
either  2-  or  3-cycle  solves,  in  order  to  roughly  minimize  the  execution 
time.  Then,  for  example,  a  diffusion  problem  in  1000  unknowns  can  be 
accurately  solved  for  1000  timesteps  at  just  over  one  minute  of  CPU  time. 

The  second  set  of  test  results,  summarized  in  Table  2,  is  for  the 
Poisson  boundary-value  problem 


% 


# 

% 


TABLE  2.  Poisson  test  problem  results  obtained  on  (a)  16x16,  (b)  32x32,  and 
(c)  64x64  grids*  For  each  L-level,  C-cycle  solve  (L=C=1  is  direct 
solve),  the  operation  count  per  cell,  the  solve  time  in  y s  per 
cell,  and  the  maximum  absolute  error  are  tabulated.  For  each  L, 
the  core  storage  required  in  number  of  fine-grid  arrays  is  given. 
Insufficient  memory  was  available  to  solve  the  64x64  problem 
directly  (*) . 


Cycles 
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2 
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3 

4 
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595 

113 

2.9xl0-5 
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ys/cell 
maximum  error 

2 

147 

53 

4.7xl0-3 

99 

45 

5.9xl0-3 

3 

208 

95 

1 . 2x 10~3 

Storage 

40.8 

9.5 

5.4 

i 

2210 

262 

8.3xl0~5 

2 

357 

65 

1.5xl0-3 

117 

28 

2.8X10-3 

102 

27 

3.2xl0-3 

3 

250 

61 

2.3xl0“4 

246 

77 

2.8X10-4 

Storage 

72.4 

13.1 

5.4 

5.1 

i 

8514 

* 

* 

2 

1164 

137 

3. 2x  10-l+ 

194 

31 

6.5X10-4 

108 

18 

8.8xl0-4 

103 

18 

1.0X10-3 

3 

431 

69 

5.8xl0-5 

273 

52 

3.4xl0-5 

274 

65 

4.3X10-5 

Storage 

136.2 

20.9 

6.2 

4.9 

4.9 
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on  the  unit  square,  again  with  the  boundary  condition 


and  the  solution 


u  =  0 


u(x,y)  =  sin(2rx)sin(iry) . 


(35) 


(36) 


As  before,  four  Jacobi  relaxation  sweeps  were  performed  after  each 
interpolation  in  the  first  C-l  multigrid  cycles. 

The  core  storage  required  by  the  Poisson  solver  is  smaller  by  six 
arrays  at  each  level  —  the  five  bands  of  coefficients  in  the  matrices  and 
the  reciprocals  of  the  diagonal  elements  -  so  that  it  asymptotes  to 

M  =  2  +  2  =■  4.7. 

The  operation  count  for  the  matrix  generation  is  negligible,  and  the 

* 

asymptotic  operation  count  per  cell  for  the  solve  is  reduced  to 


"c-2  '  100 

and 

Wc=3  =  330. 

The  savings  come  in  the  relaxation  iterations  and  the  residual  calculations 
and  arise  from  the  symmetry  of  the  matrices  about  the  diagonal.  This 
reduction  in  the  operation  count  is  not,  however,  matched  by  a  comparable 


* 
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reduction  in  the  execution  time,  A  breakdown  of  the  timing  by  components 
shows  that  the  relaxation  iterations  and  residual  calculations,  although 
they  require  fewer  operations,  require  about  as  much  time  to  execute  as 
for  the  elliptic  solver.  This  is  due  to  a  combination  of  less  efficient 
retrieval  and  storage  of  data  and  of  more  effort  expended  on  short  vectors 
by  these  components  of  the  Poisson  solver.  The  benefit  of  eliminating  the 
storage  of  the  bands  of  matrix  coefficients  is  obtained  at  the  cost  of 
making  these  portions  of  the  solver  more  cumbersome  and  slow.  Clearly  a 
net  gain  is  realized,  however,  primarily  in  the  amount  of  core  required, 
though  secondarily  also  in  the  total  execution  time. 
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CONCLUSIONS 


The  numerical  tests  summarized  in  the  preceding  section  demonstrate 
conclusively  the  power  of  multigrid  techniques  in  obtaining  numerical 
solutions  to  elliptic  boundary-value  problems.  These  methods  may  make 
practical  the  solution  of  problems  which  heretofore  would  have  been 
considered  prohibitively  expensive  or  even  impossible  to  attempt, 
particularly  when  extended  to  a  larger  number  of  spatial  dimensions. 

It  may  also  prove  possible  to  implement  multigrid  techniques  with  a 
sufficient  degree  of  parallelism  to  make  them  highly  effective  for 
use  on  systems  of  parallel  processors.  Perhaps  the  major  drawback 
of  these  methods  is  the  large  startup  effort  required.  The  complexity 
of  multigrid  algorithms,  with  their  intergrid  transfers  and  recursive 
cycling,  makes  the  development  and  debugging  of  a  working  program  a 
time-consuming  task.  The  potential  payoff  is  sufficiently  great,  however, 
to  warrant  the  investment. 
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